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ABSTRACT 

A general inversion technique for the recovery of the underlying distribution function 
for observed galactic disks is presented and illustrated. Under the assumption that 
these disks are axi-symmetric and thin, the proposed method yields the unique distri¬ 
bution compatible with all the observables available. The derivation may be carried 
out from the measurement of the azimuthal velocity distribution arising from position¬ 
ing the slit of a spectrograph along the major axis of the galaxy. More generally, it may 
account for the simultaneous measurements of velocity distributions corresponding to 
slits presenting arbitrary orientations with respect to the major axis. The approach 
is non-parametric, i.e. it does not rely on a particular algebraic model for the distri¬ 
bution function. Special care is taken to account for the fraction of counter-rotating 
stars which strongly affects the stability of the disk. 

An optimisation algorithm is devised - generalising the work of Skilling & Bryan 
(1984) - to carry this truly two-dimensional ill-conditioned inversion efficiently. The 
performances of the overall inversion technique with respect to the noise level and 
truncation in the data set is investigated with simulated data. Reliable results are 
obtained up to a mean signal to noise ratio of 5 and when measurements are available 
up to 4i?e- A discussion of the residual biases involved in non parametric inversions is 
presented. Prospects of application to observed galaxies and other inversion problems 
are discussed. 

Key words: Inversion, Methods - galactic disks, equilibria - Non-parametric analy¬ 
sis, approximation, computational astrophysics, integral equations, ill-posed problems, 
numerical analysis 


1 INTRODUCTION 

In years to come, accurate kinematical measurement of 
nearby disk galaxies will be achievable with high resolution 
spectroscopy. Measurement of the observed line profiles will 
yield relevant data to probe the underlying gravitational na¬ 
ture of the interaction holding the galaxy together. Indeed 
the assumption that the system is stationary relies on the 
existence of invariants which put severe constraints on the 
possible velocity distributions. This is formally expressed by 
the existence of an underlying distribution function which 
specifies the dynamics completely. The determination of re¬ 
alistic distribution functions which account for observed line 
profiles is therefore required in order to understand of the 
structure and dynamics of spiral galaxies. 

Inversion methods have been implemented for spheroids 


(globular c luster s or elliptic al gal ax ies) b y Merrifield (1991), 
Deion ghe (1993|), Merritt (1996) (1997), Merritt & Trem¬ 
blay (1994)_^(97), Ems ellem, Monnet & Bacon (1994), 
Dehnen (1995), Kuijken (1995), and Qian ( 1995| ). Indeed for 
spheroids, the surface density alone yields access to the even 
component of a 2-integral distribution function which may 
account for the internal dynamics (while the odd component 
can be recovered from the mean azimuthal flow). However 
the corresponding recovered distribution might not be con¬ 
sistent with higher Jeans moments, since the equilibria may 
involve three (possibly approximate) integrals. The inver¬ 
sion problem corresponding to a flattened spheroid which 
is assumed to have 2 or 3 (Stakel-base d) int egrals has been 
addressed recently by Dejonghe et al. (1996) and illustrated 
on NGC 4697. Non parametric approaches have in particular 
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been used with success by Merritt & Gebhardt (1994) and 


Gebhardt & al. (1996) to solve the dynamical inverse prob¬ 


lem for the density in spherical geom etry. If the spheroid is 
seem exactly edge on Merritt (1996) has devised a method 
which allow them to recover simultaneously the underlying 
potential. 

Here the inversion problem for thin and round disks 
is addressed where symmetry ensures integrability. In this 
context, the inversion problem is truly two-dimensional and 
requires special attention for the treatment of quasi-radial 
orbits in the inner part of the galaxy. 

By Jeans’ theorem the steady-state mass-weighted dis¬ 
tribution function describing a flat galaxy must be of the 
form / = f(e, h), where the specific energy, e, and the spe¬ 
cific angular momentum, h, are given by 


e = + v^) — Ip , and h = Rv^. 


( 1 ) 


Here vh and are the star radial and angular velocities re¬ 
spectively of stars confined to a plane and tp{R) is the gravi¬ 
tational potential of the disk. The azimuthal velocity distri¬ 
bution, F^{R,v^), follows from this distribution according 
to: 


F^{R,v^)= / f{s,h)dvR, 


( 2 ) 


where the integral is over the region -b n^) < ibcm- 

responding to bound orbits. Pichon and Lynden-Bell (1996) 


demonstrated that, in the case of a thin round galactic 
disk, the distribution can be analytically inverted to yield 
a unique f{e, h) provided the potential 'ip{R) is known. The 
velocity distribution F^{R,v^) can be estimated - within a 
multiplicative constant - from line of sight velocity distribu¬ 
tion (LOSVD) data obtained by long slit spectroscopy when 
the slit is aligned with the major axis of the galactic disk pro¬ 
jected onto the sky. Similarly, the rotation curve observed in 
HI gives in principle access to the underlying potential. More 
generally, simultaneous measurements of velocity distribu¬ 
tions are derived with slits presenting arbitrary orientations 
with respect to the major axis as discussed in Appendix 

The inversion of Eq. (|^ is known to be ill-conditioned: a 
small departure in the measured data (e.g. due to noise) may 
produce very different solutions since these are dominated by 
artifacts corresponding to the amplification of noise. Some 
kind of balance must therefore be found between the con¬ 
straints imposed on the solution in order to deal with these 
artifacts on the one hand, and the degree of fluctuations con¬ 
sistent with the assumed information contents of the signal 
on the other hand (i.e. the worse the data quality is, the 
lower the informative contents of the solution will be and 
the more constraint the restored distribution will be so as 
to avoid an over-interpretation of the data ). Finding such a 
balance is called the “regularisation” of the inversion prob¬ 
lem (e.g. (Wahba & Wendelberger, 1979)) and methods im¬ 
plementing adaptive level of regularisation are described as 
“non parametric”. 

Under the assumption that these disks are axi- 
symmetric and thin, the proposed non parametric methods 
described in this paper yield in principle a unique distribu¬ 
tion: the smoothest solution consistent with all the avail¬ 
able observables, the knowledge of the level of noise in each 
measurement and some objective physical constraints that 
a satisfactory distribution should fulfill. 


Section g presents all relevant theoretical aspects of reg¬ 
ularisation and non parametric inversion for galactic disks 
distributions. Section ^ present the various algorithms and 
the corresponding numerical techniques which we imple¬ 
mented in steps to carry efficiently this two-dimensional 
minimisation. It corresponds i n ess ence to an extension of 
the work of Skilling & Bryan ( igS'^ ) for maximum entropy 
to other penalising functions which are more relevant in 
this context. All techniques are implemented in section 4 
on simulated data arising when the slit of the spectrograph 
is aligned with the long axis of the projected disk. A discus¬ 
sion follows. 


2 NON-PARAMETRIC INVERSION FOR 

FLAT & ROUND DISKS 

The non parametric inversion problem involves finding the 
best solution to Eq. (^ for the distribution function when 
only discretised and noisy measurements of F^{R,V 4 ,) are 
available. 

A distinction between parametric and non-parametric 
descriptions may seem artificial: it is only a function of how 
many parameters are needed to describe the model with 
respect to the number of independent measurements. In a 
parametric model there are a small number of parameters 
compared to the number of data samples. This makes the in¬ 
version for the parametric model somewhat regularised, i.e. 
well-conditioned. Yet, once the model has been chosen, there 
is no way to control the level of regularisation and the inver¬ 
sion will always produce a solution, whether the parametric 
model and its implicit level of regularisation is correct or not. 
In a non-parametric model, as a result of the discretisation, 
there is also a finite number of parameters but it is compa¬ 
rable and usually larger than the number of data samples. 
In this case, the amount of information extracted from the 
data is controlled explicitely by the regularisation. Here the 
latter non parametric method is therefore prefered since no 
particular unknown physical model for disks distributions is 
to be favoured. 


2.1 The discretised kinematic integral equation 

Since e is an even function of vr and since the relation be¬ 
tween vr and £ is one-to-one on the interval vr £ [0, oo) and 
for given R and Vt/,, Eq. (^) can be rewritten explicitly as: 

0 

F4R,v^) = V2 [ (3) 

J \/e + Y{R,v^) 

-Y(R,v^) 

where the effective potential is given by: 

Y{R,v^) = i;{R)-^vl. (4) 

For a given angular momentum h the minimum specific en¬ 
ergy is: 

( j 

emin(h) = min — - ■i/>(R) \ . (5) 

Re[o,oo) I 2R^ J 

From Eq. (^, the generic ill-conditioning of Eq. (|^ appears 
clearly since the integral relation relating the azimuthal ve- 
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locity distribution and the underlying distribution is an Abel 
transform (i.e. a half derivative). 

Given the error level in the measurements and the hnite 
number of data points A^data, f{£,h) is derived by htting 
the data with some model. Since the number of physically 
relevant distributions f{e,h) is very large, a small number 
of parameters cannot describe the solution without further 
assumptions (i.e. other than that the disk is round and thin). 
A general approach must therefore be adopted; for instance, 
the solution can be described by its projection onto a basis 
of functions {ek{e, h)-,k = l,..., N}: 


N 

f{e,h) = '^fk ekie,h) . 
k=l 


( 6 ) 


The parameters to fit are the weights fk- In order to fit a 
wide variety of functions, the basis must be very large; con¬ 
sequently the description of /(e, h) is no longer parametric 
but rather non-parametric. 

In order to account for the fact that the equilibrium 
should not incorporate unbound stars it is best to define the 
functions ek{£, h) of Eq. (P) so that they are identically zero 
outside the interval (e, h) G [£min(h), 0] x IR. It is convenient 
to rectify this interval while replacing the integration over 
specihc energy in Eq. (^) by an integration with respect to: 


rj = l- 


£ 

£min (h) ’ 


(7) 


and to use a new basis of functions: 


ek{'n,h) = efe((l-r?)£min(h), h) , 

which are zero outside the interval ( 77 , h) G [0,1] x IR. Here 
77 is some measure of the eccentricity of the orbit. Using this 
new basis functions, the distribution function becomes: 


N 

fiv, h) = f {{7-v) £min(h), h) ='^fk ikiv, h) ■ ( 8 ) 


Another important advantage of this reparametrisation is 
that the distributions /{r], h) can be assumed to be smoother 
functions along 77 and h since these distributions correspond 
to the equilibria of relaxed and cool system which have gone 
through some level of violent relaxation in their formation 
processes and where most orbits are almost circular. Note 
nonetheless that this assumption is somewhat subjective and 
introduces some level of bias corresponding to what is con¬ 
sidered to be a TOod distribution function as will be dis¬ 
cussed in section Clearly the assumption that the distri¬ 
bution function should be smooth (i.e. without strong gra¬ 
dients) in the variable 77 yields different constraints on the 
sought solution than assuming it should be smooth in the 
variable £. 

Real data correspond to discrete measurements Ri and 
of R and respective^. Following the non-parametric 
expansion in Eq. (^, Eq. (^ now becomes: 

N 

Fi,j = F^{Ri,V(f,j) ~ ^ fk , (9) 

k^l 


with 


fli, 


D 

\j 2£niini,j / 
j nr 


ek{v,RiV<l>j) , 

— , d77. 

V»7 - 


( 10 ) 
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where 

£min*,7 = £miiilRi rcj) j'j , — IT "F ^Ri ^ V(j) f) j -(11) 

The implementation of this linear transformation for linear 
B-splines is given in Appendix A. Since the relations between 
F^{R,vf) and /(? 7 , h) or f{r],h) are linear, Eq. (^ - the 
discretised form of the integral equation (|^ - can be written 
in a matrix form by grouping index i with index j: 

F = a-f. (12) 

The problem of solving Eq. (^) becomes a linear inversion 
problem. 


2.2 Maximum Penalised Likelihood 

In order to model a wide range of distributions f{rj, h) with 
good accuracy, the basis {efc( 77 ,/i); fc = 1 ,...,^^} must be 
sufficiently general (otherwise the solutions will be biased by 
the choice of the basis just as a parametric approach is biased 
by the choice of the model). The inversion should therefore 
be regularised and performed so as to avoid physically irrele¬ 
vant solutions. Indeed, being a distribution, /{rj, h) must for 
instance be positive and normalised. Finally, the inversion 
should provide some level of flexibility to account for the fact 
that the sought distribution might have a critical behaviour 
for some fraction of phase space such as that corresponding 
to radial orbits. It should also cope with incomplete data 
sets and should yield some means of extrapolation. 

In order to address these specihcities let us explore tech¬ 
niques able to perform a reliable practical inversion of this 
ill-conditioned problem and put the method brought for¬ 
ward in this paper into context. The Bayesian description 
provides a suitable framework to discuss how the practical 
inversion of Eq. ( p^ should be performed. 

2.2.1 Bayesian approach 

When dealing with real data, noise must be accounted for: 
instead of the exact solution of Eq. (^, it is more robust to 
seek the best solution compatible with the data and, possi¬ 
bly, additional constraints. A criteria allowing to select such 
a solution is provided by probability analysis. Indeed, given 
the measured data F, one would like to recover the most 
probable underlying distribution f. This is achieved by max¬ 
imising the probability of the distribution f given the data 
F, Pr(f I F), with respect to f. According to Bayes’ theorem, 
Pr(f I F), can be rewritten as: 

Pr(flF) = ^£m^, (13) 

Pr(F) 

where Pr(F| f) is the probability of the data F given that 
it should obey the distribution f, while Pr(F) and Pr(f) are 
respectively the probability of the data F and the probabil¬ 
ity of the distribution f. Since Pr(F) does not depend on f, 
maximising Pr(f | F) with respect to f is equivalent to min¬ 
imising: 


Q(f) = 

L{{)+fkR{{), 

(14) 

with 



L(f) 

= -aLog[Pr(F|f)]-be. 

(15) 

uR{f) 

= —Q:Log[Pr(f)] -b c , 

(16) 
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with a > 0 and where c and c' are constants which account 
for any contribution which does not depend on f. Minimis¬ 
ing the likelihood, L{f), enforces consistency of the model 
with the data while minimising -R(f) tends to give the “most 
probable solution” when no data is available as discussed in 


section 2.2.3 


2.2.2 Maximum likelihood 

Minimisation of L(f) alone in Eq. yields the maximum 
likelihood solution. The exact expression of —Log[Pr(F| f)] 
can usually be derived and depends on the noise statistics. 
For instance, assuming that the noise in the measured data 
follows a normal law, maximising the likelihood of the data 
is obtained by minimising the of tho data: 

-Log[Pr(Flf)l = ix"+c" with ^ ^ 

^ ^ Var(Fi.j) 


where Fij is the model of given by Eq. and Fij de¬ 
notes the measures of F^. Minimisation of x is known as 
Chi-square fitting. Throughout this paper and for the sake 
of clarity, Gaussian noise is assumed while defining the like¬ 
lihood term by: 


L(f) 




(Fij - hif 

Var(Fi.j) 


(17) 


(which incidentally corresponds to the choice a = 2 in 
Eq. (^^). In the limit of a large number of independent 
measurements, A^data, follows a normal law with an ex¬ 
pected value and a variance given by: 

Expect(x^) = Ndata , and Var(x^) = 2 Ndata • (18) 

It follows that any distribution, f, yielding a value of x^ in 
the range A^data ± -\/2 A^data, is perfectly consistent with the 
measured data: none of those distributions can be said to be 
better than others on the basis of the measured data alone. 

For a parametric description and provided that the 
number of parameters is small compared to A^data, the re¬ 
gion around the minimum of x^ is usually very narrow. In 
this case. Chi-square htting may be sufficiently robust to 
produce a reliable solution (though this conclusion depends 
on the noise level and assumes that the parametric model is 
correct). 

In a non-parametric approach, given the functional free¬ 
dom left in the possible distributions, it is likely that the 
value of the x^ can be made arbitrarily small, i.e. much 
smaller than A^data- Consequently, the solution which min¬ 
imises x^ is not reliable: it is too good to be true! In other 
words, solely minimising x^ iu a non-parametric descrip¬ 
tion leads to an over-interpretation of the data: due to the 
ill-conditioned nature of the problem, many features in the 
solution are likely to be artifacts produced by amplihcation 
of noise or numerical rounding errors. 


2.2.3 Regularisation 

Minimising the likelihood term forces the model to be con¬ 
sistent with some objective information: the measured data. 
Nevertheless, this approach provides no means of selecting 
a particular solution among all those which are consistent 


with the data (i.e. those for which Z/(f) = Ndata±\/2 Ndata)- 
Taking into account pi7(f) = —aLog[Pr(f)] -|- c” in Eq. ( [l^ 
yields a natural procedure to choose between those solutions. 
At least, there are some objective properties of the distri¬ 
bution f{ri,h) which are not enforced by Chi-square fitting 
(e.g. positivity) and which could be accounted for by the 
fact that Pr(f) must be zero (i.e. R{f) oo) for physically 
irrelevant solutions. 

Unfortunately, e.g. for noisy data, taking into account 
those objective constraints alone is seldom sufficient: addi¬ 
tional ad-hoc constraints are needed to regularize the inver- 
siou problem. To that eud, R{f) is geuerally dehned as a 
so-called penalising function which increases with the dis¬ 
crepancy between f and those subjective constraints. 

To summarise, the solution of Eq. (^ is found by min¬ 
imising the quantity Q(f) = L{f) -|- ^iR{f) where L{f) and 
R{f) are respectively the likelihood and regularisation terms 
and where the parameter fj, > 0 allows to tune the level of 
regularisation. The introduction of the Lagrange multiplier 
p, in Eq. (0) is formally justified by the fact that Q(f) should 
be minimised subject to the constraint that L(f) should be 
equal to some value, say Ng. For instance, with L(f) = X^(f) 
one would choose 

Ne £ [Adata “ V2 Adata, Adata + ^2 Adata] ■ 


2 . 2 . 4 . Definitions of the penalising function 


When data consist in samples of a continuous physical sig¬ 
nal, uncorrelated noise will contribute to the roughness of 
the data. Moreover, noise amplihcation by an ill-conditioned 
inversion is likely to produce a forest of spikes or small scale 
structures in the solution. As discussed previously, assuming 
that the “probability” Pr(f) increases with the smoothness 
of f{r], h), the penalising function should limit the effects of 
noise while not affecting (i.e. biasing) too much the range of 
possible shape of f{ri,h). To that end, the penalising func¬ 
tion i?(f) should be dehned so as to measure the roughness 
off. 

Many diherent penalising functions can be dehned to 
measure the ro ughness of f{jq,h)\ for instance minimising 
(Wahba , 1990): 


m = 


V"/-V"/ 


d77 dh , with V = 


dh 'drj 


(19) 


(where n > 1) will enforce the smoothness of f(r/, h). In the 
instance of a discretised signal for Eq. (^, such quadratic 
penalising functions can be generalised by th e use of a pos¬ 
itive dehnite operator K (Titterington, 1985): 


i?quad(f) , 


( 20 ) 


where stands for the transpose of f. 

Strict application of the Bayesian analysis implies that 
the penalising function R{f) is —Log[Pr(f)] (up to an addi¬ 
tive constant and the factor p) which is the negative of the 
entropy of f. This has led to the family of maximum entropy 
methods (hereafter MEM) which are widely used to solve 
ill-conditioned inverse problems. In fact MEM only differs 
from other regularised methods by the particular dehnition 
of the penalising function which provide p ositivity ab init io. 
A possible definition of the negentropy is (Skilling, 198S|): 
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-RMEM(f) = \fk Log— — fk + Pk] 

k 


( 21 ) 


where p is the a priori solution: the entropy is maximised 
when f = p. Although there are arguments in favour of that 
particular definition, there are many other possible options 
(Narayan fc Nityananda, 1986|) which lead to similar solu¬ 


tions. Penalising functions in MEM all share the property 
that they become infinite as f reaches zero, thus enforcing 
positivity. In ord er to further enforce the smoothness of the 
solution, Horne (1985) has suggested the use of a floating 
prior, defining p to be f smoothed by some operator S: 


p = S'f. (22) 

For instance, along each dimension of f{rj,h), the following 
mono-dimensional smoothing operator is applied: 

( (1 - 7)/i + ifi+i , if i = 1, 

Pi = S 7/i-i + (1 - 27)/i + ifi+i , if 1 < i < n, 

I (1 - 7)/i + 7/i-i , ifi = n, 

with 0 < 7 < 1/2 (here: 7 = 1/4); here i = l,...,n stands 
for the index along the dimension considered. This operator 
conserves energy, i.e. X]P ~ S /• 

The penalising functions iZquad(f) or iZMBM(f) with a 
floating prior p = S-f is implemented in the simulations to 
enforce the smoothness of the solution. 


2.2.5 Adjusting the weight of the regularisation 


Thompson and Craig (1992) compared many different objec¬ 


tive methods to fix the actual value of p. Generally speaking, 
these methods consists in minimising Q(f) given by Eq. ( p^ 
subject to the constraints L(f) = Ne where Ne is equivalent 
to the number of degrees of freedom of the model. Among 
those methods, two can be applied to non-quadratic penal¬ 
ising functions (such as the negentropy). 

The most simple approach is to minimise Q{f) subject 
to the constraint that L(f) = Expe ctfL(f)] = ATiata.. This 
yields an over-regularised solution (Gull, 1989) since it is 
equivalent to assuming that regularisation controls no de¬ 
grees of freedom. 


A second method is due to Gull (1989) who demon¬ 
strated that the Lagrange parameter should be tuned so 
that Q(f) = A^data, i.e. Ne = A^data — pR{f). In other words, 
the sum of the number of degrees of freedom controlled by 
the data and by the entropy is equal to the number of mea¬ 
surements. This method is very simpl e to implement but 
can lead to under-regularised solutions ( Gull, 19^ ; Thomp¬ 
son & Graig, 1992). Indeed if the subjective constraints pull 


f too far from the true solution then R(f) takes a high value 
as soon as any structure appears in f{rj,h). As a result, 
in order to meet Q{f) = Adata, the value of p is found to 
be very small by this procedure. For instance, this occurs 
in MEM methods when choosing a uniform prior p since a 
uniform distribution is very far from the true distribution. 
Nevertheless,this kind of prob lem was not encountered with 
a floating prior (Horne, 1985). In the algorithm described 
below this latter method (i.e. Gull plus Horne methods) is 
implemented to obtain a sensible value for p. 

Another potentially attractiv e way to find the value of 


U is the cross-validation method (Wahba & Wendelberger 


1971) since it relies solely on the data. Let Fip be the value 


at {i,j) of the model which fits the subset of data derived 
while excluding measurement {i,j) (in other words, Fij pre¬ 
dicts the value of the assumed missing data point Fij); since 
the fit is achieved by minimising Q{f), the total prediction 
error, given by: 


TPE = ^ 


[hi - Kjf 

Var(Fi,,0 


will depend on the sought value of p. The so-called cross- 
validation method chooses the value of p that minimise TPE. 
When the number of data points is large this metho d be¬ 
comes too CPU in tensiv e. Nonetheless Wahba (1990) and 
also Titterington (1985) provide efficient means of choos¬ 
ing p when the model is linear which involve constructing 
the so called generalised cross validation estimator for the 
TPE. 


3 NUMERICAL OPTIMISATION 

In the previous section it was shown that the inversion prob¬ 
lem reduces to the minimisation of a multi-dimensional func¬ 
tion Q(f) = I/(f) + pR{f) with respect to a great number of 
parameters (from a few 10“^ to 10®) and subject to the con¬ 
straints that (i) the likelihood term keeps some target value: 
I/(f) = Ne, (ii) all parameters remain positive and that (iii) 
special care is taken along some physical boundaries. Unfor¬ 
tunately there exists no general black-box algorithm able to 
perform this kind of optimisation. 

Let us therefore investigate in turns three techniques to 
carry the minimisation of increasing efficiency and complex¬ 
ity: direct methods, iterative minimisation along a single di¬ 
rection (accounting for positivity at fixed regularisation) and 
iterative minimisation with a floating regularisation weight. 


3.1 Linear solution 

Using quadratic regularisation, the problem is solved by 
minimizing: 

Oquad(f) = (F-a-f)^-W-(F-a-f)-|-Mf^-K-f . (23) 

where W is the inverse of the covariance matrix of the data. 
The solution fquad which minimises Qquad is: 

fquad = (a^-W-a-b^K)-La-L.W-F. (24) 

This solution, which is linear with respect to the data, is 
clearly not constrained to be positive. 


3.2 Non linear optimisation 

Linear methods only provide raw, possibly locally negative, 
solutions. At the very least, enforcing positivity of the so¬ 
lution - and more generally if the penalised function is not 
quadratic - requires non-linear minimisation. In that case, 
the minimisation of (3(f) must be carried out by successive 
approximations. 

At the n*® step, such iterative minimisation methods 
usually proceed by varying the current parameters f along 
a direction so as to minimise Q-, the new estimate of 
the parameters reads: 

j.(ra+l) _ j.(ra) ^(n) gf(n) 
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Figure 1. Fit of (as described in section y) with various penalizing functions. From top-left to bottom-right: (1) original and F^ 
restored by (2) MEM with uniform prior, (3) MEM with floating smooth prior and (4) quadratic regularisation, i.e. Rquad with n = 1 as 
defined in Eq. As expected, no significant difference is to be found in the fits, though in panel (2), F,^ is slightly rougher. The SNR 
is 50. 


where the optimum step size is the scalar: 
A*-"^ = arg{mm[Q(f^"^ -|- A5f^"^)]} . 


(26) 


The pri nblem being t o choose suitable successive direc t ions 
of mini miBation. - 


3.2.1 Optimum direction of minimisation 

In principle, the optimum direction of minimisation df could 
be derived from the Taylor’s expansion: 

Q{{ + 5f) ^ g(f) + ^ 5/, -P i ^ Sf.Sfi , (27) 




dfkdfi 


that is minimised for the step 

= -(wQ)“^-vg, 


(28) 


There exist a number of multi-dimensional minimisa¬ 
tion numerical routines that avoid the direct computation of 
the inverse of the Hessian matrix: e.g. steepest desc ent, con¬ 
jugate gradient algorithm, Powell’s method, etc. (Press et 


ah, 1988). For the steepest descent method, the direction of 


minimisation is simply given by the gradient: dfsD = —VQ. 
Other more efficient multi-dimensional minimisation meth¬ 
ods attempt to build information about the Hessian while 
deriving a more optimal direction, i.e. a better approxima¬ 
tion of —(VVQ)~^-Vg. For instance, the conjugate-gradient 
method builds a series of optimum conjugate directions 
5fcG, each of which is a linear combi nation of the cur - 
rent gradient and the previous direction ( |^ress et ah, 1988 ). 
Among those improved methods and when the number of 
parameters is very large, the choice of conjugate-gradient is 
driven by its efficiency both in terms of convergence rate 
and memory allocation. 


where VQ and VVQ are respectively the gradient vector and 
the Hessian matrix of g(f): 

VQ/c = 1^ , and WQ,,i = . 

dfk dfkdfi 

The whole difficulty of multi-dimensional minimisation deals 
with estimating the inverse of the Hessian matrix, which 
may generically be too large to be computed and stored. A 
further difficulty arises when Q(f) is highly non-quadratic 
(e.g. in MEM) since the behaviour of Q(f) can significantly 
differ from that of its Taylor’s expansion. 


3.2.2 Accounting for positivity 

Let us now examine the non-linear strategy leading to a min¬ 
imisation of g(f) with the constraint that f{ri, h) > 0 every¬ 
where. We will assume that the basis of functions {ek{ri, h)} 
is chosen so that the positivity constraint is equivalent to 
enforcing that fk > 0;VA: (see Appendix ^ for an example 
of such a basis). 

When seeking the appropriate step size given by 
Eq. (p^, it is possible to account for positivity by limiting 
the range of A*-"^: 
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Figure 2. Restoration of f{r}, h) from Fig. ^ (from top-left to bottom-right): (1) true distribution and distributions restored by (2) MEM 
with uniform prior, (3) MEM with smooth floating prior and (4) quadratic regularisation, i.e. Rquad with n = 1 as deflned in Eq. 
Note that MEM with a uniform prior yields a rather unsmooth solution, which is expected since no penalty is imposed by this method 
for lack of smoothness. 


c(n + l) 


> 0 4^ — min 


r{n) 

Jk 


< < - max 


n(n) 

Jk 




In practice this procedure blocks the steepest descent 
method long before the right solution is found. It is in fact 
better to truncate negative values after each step: 


f, 


(n-t-l) _ 


max{0,/f)+A('‘) 


Besides, any of these methods to enforce positivity breaks 
conjugate-gradient minimisation since this latter assumes 


that thle true minimum ot -|- is reached 

.in) -^^- 


while varying 

Thiebaut & Conan (1995) circumvent this difficulty 
thanks to a reparametrisation that enforces positivity. Fol¬ 
lowing their argument, Q is minimised here with respect to 
a new set of parameters x such as: 


fk = g{xk), with p : IR IR+ . 


(29) 


The following various reparametrisations meet these require¬ 
ments: 

fk = exp(xk) ^ g'ixkf = fl , 

fk = Xfe" (u positive integer) g'{xk)^ x . 

When Q(f) is quadratic, Q{f -I- Adf) is a second order 
polynomial with respect to A, the minimisation of which 
can trivially be performed with a very limited number of 
matrix multiplications. One drawback of the reparametri¬ 
sation is that, since g{-) is non-linear, Qog{x) is necessar¬ 
ily non-quadratic. In that case the exact minimisation of 


(5o(;(x -f A(5x) - mandatory in conjugate gradient or Powell’s 
methods - requires many more matrix multiplications. An¬ 
other drawback is that the direction of investigation derived 
by conjugate gradient or Powell’s methods may be no longer 
optimal requiring many more steps to obtain the overall so¬ 
lution. This latter point follows from the fact that these 
methods collect information about the Hessian while taking 
into account the previous steps, whereas for a non-quadratic 
functional this information becomes obsolete very so on since 
the Hessian (with respect to x) is no longer constant (Skilling 


& Bryan, 1984). 

Consequently, instead of varying the parameters x, 
we propose to derive a step for varying f from the 
reparametrisation that enforces positivity. Let 5x be the 
chosen direction of minimisation for x, the sought param¬ 
eters reads: g{xk + XSxk) — fk + XSxk g'{xk)- Identifying 
the right hand side of this expression with fk + XSfk yields: 
Sfk = Sxk g'{xk)- Using the steepest descent direction: 

dQ dQ , 
yielding finally: 

with 1 <v <2 depending on the particular choice of g{-). 
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Figure 3. Synopsis of the single direction minimisation algo¬ 
rithm.. Here ei ~ 10“® and £2 — 10“®. 


3.2.3 Algorithm for ID minimisation: positivity at fixed pL 

In Appendix we show that other authors have derived 
very similar optimum direction of minimisation but in the 
more restrictive case of a regularisation by I^mem. Note that 
our approach is not limited to this type of penalising func¬ 
tion since positivity is enforced extrinsically. In short, the 
minimisation step is derived from the unifying expression: 

= -qk^Qk , (31) 

where the gradient is scaled by (see Appendix : 

this work (with 1 < v < 2) , 
Richardson-Lucy, 
classical MEM, 

Cornwell-Evans. 

h- + fkY. vJiF-j) 

i,3 


Qk = < 


fk, 
fk , 
fk/p, 
fk 


3 . 2.4 Performance issues 

During the tests, it was found that conjugate-gradient 
method with reparametrisation and iterative methods with 
direction given by Eq. (E) require roughly the same number 
of steps (one step involving minimisation along a new direc¬ 
tion of minimisation). Yet the non-linear reparametrisation 
required to enforce positivity in conjugate gradient method 
prevents interpolation and in effect spends much more time 
(a factor of 10 to 20) to perform line minimisation. Besides, 
when the current estimate is far from the solution, minimisa¬ 
tion direction following our prescription (|^) or that of clas¬ 
sical MEM or Lucy spends fewer steps than that of Cornwell 
& Evans to bring f near the true solution. When the cur¬ 
rent estimate is sufficiently close to the solution, Cornwell 
& Evans’s method spends half as many steps as the other 
methods to reach the solution. The best compromise is to 
start with Sfk oc —fk'^Qk, then after some iterations use 
= 5fcE. As a rule of thumb, for low signal-to-noise ratios 
{SNR ~ 5) about as many steps as number of parameters 
are required, for high signal-to-noise ratios {SNR > 30) fewer 
steps are needed (up to 10 times less). It remains that with 
these methods trial and error iterations are required to hnd 
the appropriate value for p. The different irrmlementations 
are illustrated and compared in Fig. ^ and g as described 
in section (^. 

Accounting for positivity in multi-dimensional optimi¬ 
sation therefore leads to a modihed steepest descent algo¬ 
rithm for which the current gradient is locally rescaled. A 
faster convergence is achieved when some information from 
the Hessian is extracted appropriately. Yet, the above de¬ 
scribed algorithm assumes that optimisation is performed 
with a fixed value of the Lagrange parameter p. Let us now 
turn to a more general minimisation along several direction 
which allow pi to be adjusted on the fly during the minimi¬ 
sation. 


3.3 Minimisation along several directions 
3.3.1 Skilling & Bryan method revisited 


The scheme of the ID-optimisation algorithm is illus¬ 
trated in Fig. Iterations are stopped when the decrement 
in <3(1) becomes negligible, i.e. when: 

|Q(f + A<5f)-Q(f)| <£|Q(f)|, 


where e > 0 is a small number which should not b e smaller 
than the square root of the machine precision (Press et 
al., 198^. Lucy ( 199^ ) has suggested another stop criterion 
based on the value of the ratio: 


P=Pf|l/(l|5fL|| + ||5fR||), 

where and are the directions which minimise the 
likelihood and the regularisation terms: 

5fL = — q X VL and 5fR = — /rq x NR , 

where x denotes element-wise product (in other words q 
stands loosely for Diag( 5 i,..., qn)). In practice and regard¬ 
less of the particular choice for q, the algorithm makes no 
significant progress when p becomes smaller than 10“®. 


In the context of max imum entropy image restoration. 
Skilling & Bryan (1984) (hereinafter SB) have proposed a 
powerful method which is both efficient in optimising a non¬ 
linear problem with a great number of parameters and able 
to vary automatically the weight of regularisation so that 
the sought solution satisfies Z/(f) = AL. Here their approach 
is further generalise to any penalising function. In short, SB 
derive their method from the following remarks: 


(i) To account for positivity, they suggest an appropriate 
“metric” (or rescaling) which is equivalent to multiplying 
each minimisation direction by q = f. 

(ii) The regularisation weight p is adjusted at each itera¬ 
tion to meet the constraint L(f) = AL. Therefore, instead of 
minimising along the single direction —q x (VL -|- pNR), at 
least two directions are considered: —q x VL and —q x NR. 

(iii) Since the Hessian is not constant - at least since p 
is allowed to vary, no information is carried from the pre¬ 
vious iterations. This clearly excludes conjugate-gradient or 
similar optimisation methods, but favours non-quadratic pe¬ 
nalising functions for which the Hessian is not assumed to 
be constant. 
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(iv) If the whole Hessian cannot be computed, it can nev¬ 
ertheless be applied to any vector e of same size as f in a 
finite number of operations, e.g. two matrix multiplications 
for the likelihood term: VVL • e = 2a^-(a'e) (where, for 
the sake of simplicity, the diagonal weighting matrix was 
omitted here). They illustrate how this provide a means to 
include some knowledge from the local Hessian while seeking 
the optimum minimisation direction. 


3.3.2 Local minimisation sub-space 

In order to adjust the regularisation weight, at least two 
simultaneous directions of minimisation should be used: 
— q X VL and —q x VR. Furthermore, the local Hessian pro¬ 
vides other directions of minimisation to increase the con¬ 
vergence rate. Using matrix notation, the Taylor expansion 
of Q{f) for two simultaneous directions dfi and 5^2 reads: 

Q(f+<5fi-b5f2) g(f)+5f^Vg-t 

[vg -b Wg-5fi] -b Wg-(5f2 , 

where the Hessian and gradient are evaluated at f. Given a 
first direction 5fi, the optimal choice for a second direction 
is: 

5f2,opt = -(wg)"^-(vg -b wg-5fi). 

In MEM, recall that positivity is enforced explicitly by the 
regularisation penalty function while efficient minimisation 
methods rely on the approximation of (Wg)“^ by a scaling 
vector q. The optimum first two directions then become in 
MEM: 


5fi,opt ~ —q X vg , and (5f2,opt ~ —q x (Vg -b Wg-(5fi), 

Since the first term in the right hand side expression of 
5f2,opt is dfi.opt, the two near optimum directions sought 
are finally: 

5fi = — q X vg , and 5^ = —q x (VVg-5fi). (32) 

Similar considerations yield here further possible directions: 
5f„ = -qx (VVg-5f„-i). (33) 


If the rescaling, q, provides too good an approximation of 
the inverse of the Hessian, then 5fi and 5^2 would almost 
be identical (i.e. antiparallel); hence using only one is suf¬ 
ficient. In other words, since the local Hessian is accounted 
for by the use of additional directions of minimisation, there 
is no need that Diag(q) be an accurate approximation of 
(VVg)“^. The crude rescaling given by Eq. (Bl) is there¬ 
fore sufficient, i.e. taking: q = f. This definition of q has the 
further advantage to warrant positive values of f and does 
not depend on the actual value of p (which is obviously not 
the case for the Hessian). 

If no term in g(f) enforces positivity, it was shown ear¬ 
lier that the reparametrisation ( P9|) would. From the Taylor 
expansion of g(x), the first two steepest descent directions 
with respect to the parameters x are given by: 


Sxi^k 
5X2,k 


dQ 

dxk 


-g'{xk)VQk , 


(34) 



d^Q dQ 
dxkdxi dxi 


-g'{xk) ^ V^Qk,ig'{xi)5xi,i. 

I 
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Since Sfk — g'{xk)5xk, the near two optimum directions of 
minimisation for the parameters f read: 

5fi,k = -gfxkf^Qk, (35) 

5f2,k = -g'jxkf y^Qk,i5fi,i , (36) 

i 

which are incidentally identical to those given by Equa¬ 
tion (^, provided that qk = g'{xk)^. 

For all the regularisation penalising functions consid¬ 
ered here, clearly the best choice is to use directions given 
by the Hessian applied to Eq. ( ^ ) when other directions of 
minimisation than those related to the gradient are consid¬ 
ered. Since g can vary, the Hessians of L and R have to be 
applied separately. At each step, the minimisation is there¬ 
fore performed in the n = 3 x 2 = 6 dimensional sub-space 
defined by: 

(5fi = —q X VL , 5^2 = —q X VR , 

dfs = —q X (VVLAfi), 5f4 = —q x (VVL-5f2), (37) 

^fs = —q X (VV7?-5fi), 5f6 = —q x (VVi?-5f2), 

where 

Qk = fk with 1 <o <2. (38) 

When 1 / = 1, q is the same metric as that introduced by SB 
while relying on other arguments. Depending on the actual 
expression for VVi? (and in particular in MEM with a con¬ 
stant prior), a smaller number of directions need be explored 
(for instance, SB used only n = 3 simultaneous directions 
since when q = f, VVJ? = 1/f so = 5fi, Jfe = 5 ^2', 
they also use a linear combination of Jfs and 5f4). In this 
n-dimensional sub-space, a simple second order Taylor ex¬ 
pansion of Q{f + 'Yfi=i shows that the optimum set of 

weights sought, {Ai,..., An}, is given by the solution of the 
n linear equations parametrised by g and given by: 


XjSDj-i^L -b g\/VR)-S{i = -SDi-{\/L -b gVR). (39) 
t=i 


Now in that sub-space, the optimisation may be ill- 
conditioned (i.e. the set of linear equations are linearly de¬ 
pendent in a numerical sense). In order to deal with this de - 
generacy truncated SVD decomposition (Press et ah, 198J) 
is used to find a set of numerically independent directions. 
In practice, the rank of the 6 linear equations varies from 2 
(very far from the solution or when convergence is almost 
reached) to typically 5 or 6. This method turns out to be 
much easier to implement than the bi-diagonalisation sug¬ 
gested by SB. 


3.3.3 On the fly derivation of the regularisation weight 

At each iteration a strategy similar to that of SB was 
adopted here to update the value of g'. 

(i) Lmin and Lmax are the values of the likelihood term in 
the sub-space in the limits g —> 0 and /r —> oo respectively. 
The corresponding solutions give what we call the maximum 
likelihood solution and the maximum regularised solution in 
the sub-space. 

(ii) If Lmax < Ne the maximum regularised solution cor¬ 
responding to Lmax is adopted to proceed to the next itera¬ 
tion. Otherwise, in order to avoid relaxing the regularisation 
and following SB, a modest reachable goal is fixed: 
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Laim — mflxliVe) (1 Lprev C^^min} 


where Lprev is the likelihood value at the end of the pre¬ 
vious iteration, while 0 < a < 1 (say a = 2/3). A simple 
bi-section method is ^plied to seek the value of ^ for which 
the solution of Eq. (p9|) yields L = iaim- 

Following this scheme, the algorithm varies the value of /r 
so that at each iteration the likelihood is reduced until it 
reaches its target value; then the regularisation term is min¬ 
imised while the likelihood remains constant. 

As a stop criteria, a measure of the statistical discrep¬ 
ancy between two successive iterations: 


E r(n) r(n + l) _ r(n) 

Jk Ik Ik 

k 


k 


is computed. In practice, in order to avoid over- 
regularisation, Ne is taken to be Adata — \/2Adata. The cor¬ 
responding scheme of the n-dimensional optimisation algo¬ 
rithm is illustrated in Fig. 


3 . 3.4 Performance & assessments 

The optimisation of Q(f) in a multi-dimensional sub-space 
yields many practical advantages: (i) it provides faster con¬ 
vergence rates (about 10 times less overall iterations and 
even much less when accounting for the number of inver¬ 
sions required to derive the regularisation weight) and less 
overall CPU time in spite of the numerous matrix multiplica¬ 
tions involved to compute the n directions of minimisation 
and their images by the Hessians, (ii) It yields a more ro¬ 
bust algorithm because it is less sensitive to local minima 
and also because the routine requires less tuning, (iii) Since 
fj. varies between iterations and since the local Hessian is 
always re-estimated, the solution can be modified on the fly, 
e.g. rescaled, without perturbing the convergence. Hence the 
normalisation is no more an issue. 

This algorithm presents the following set of improve¬ 
ments over that of Skilling & Bryan: (i) a more general pe¬ 
nalising functions than entropy is considered (e.g. entropy 
with floating prior or quadratic penalising function) which 
yield a different metric derived heuristically. This yields al¬ 
most the same optimisation sub-space but from a different 
approach; (ii) truncated SVD is implemented to avoid ill- 
conditioned problems in this minimisation sub-space. 


4 SIMULATIONS 

4.1 Specificities of stellar disk inversion 

4 . 1.1 Models of azimuthal velocity distributions. 


Simulated azimuthal velocity distributions can be con- 
struc ted vi a the prescription described in Pichon & Lynden- 
Bell (1996). The construction of Gaussian line profiles com¬ 
patible with a given temperature requires specifying the 
mean azimuthal velocity of the flow, (n^) on which the Gaus¬ 
sian should be centered, the surface density, E(i?) and the 
azimuthal velocity dispersion a^. The line profile F then 
reads 


F^{R,v^) = exp 

V 27r C70 


K - W)]' 

2cr? 


(40) 
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Figure 7. Standard deviation for each SNR out of the 40 restora¬ 
tions carried out displayed as in Fig. ^ From top to bottom: 
SNR = 5, SNR = 30 and SNR = 100. Abscissa is normalised 
specific energy rj and ordinate is specific angular momentum h. 
Note that the maximum residual error is well below the signal’s 
amplitude and decreasing with increasing SNR. 
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Figure 4. Synopsis of the multiple directions multidimensional minimisation algorithm with adjustment of the regularisation weight. In 
this algorithm, /min > 0 is a small threshold used to avoid negative values, 0 < e <C 1 is a small value used to check convergence. 


Here the azimuthal velocity dispersion is related to the az¬ 
imuthal pressure, by 

al = - {V4,f . (41) 


The azimuthal pressure follows from the equation of ra¬ 
dial support, 


( 7,2 \ d{R'Eaf) 


T.dR 


(42) 


and the kinematical “temperature ” of the disk with a given 
Toomre number Q (Toomre, 1964). 


Q = 0.298 (JuK / E . 


(43) 


The expression of the average azimuthal velocity, (v^), may 
be taken to be that which leads to no asymmetric drift equa¬ 
tion: 


E (v^)^ = p<t> - Pr{hR / 2uc)^ , 


(44) 


where k is the epicyclic frequency, and Vc the velocity of 
circular orbits. Equations (^^-(^^ provide a prescription 
for the Gaussian azimuthal line profile F^. These azimuthal 
velocity distributions are used throughout to generate sim¬ 
ulated data corresponding to iso-Q Kuzmin disk. 


4.r2 The counter-rotating radial orbits 

As shown in Fig. |^, the azimuthal distributions of our mod¬ 
els have Gaussian tails corresponding to stars on almost ra¬ 
dial orbits with small negative azimuthal velocity. These few 
stars play a strong dynamical role in stabilising the disk, and 
as such should not be overlooked since they significantly in¬ 
crease the azimuthal dispersion of inner orbits, effectively 
holding the inner galaxy against its self-gravity. Now this 
Gaussian tail translates in the momentum, reduced energy 
space as a small group of counter rotating orbits introduc¬ 
ing a cusp in the number of stars near h = 0 (this cusp 
is only apparent since the distribution is clearly continuous 
and differentiable across this line). In practice the regular¬ 
isation constraint across h — 0 is relaxed, treating in effect 
independently the two regions. 


4.2 Validation & efficiency 
4-2.1 Quality estimation 

Clearly the quality level for the reconstructed distribution 
will depend upon the application in mind. For stability anal¬ 
ysis, the relevant information involves for instance its gra¬ 
dients in action space. An acute quality estimator would 
therefore involve such gradients, though their computation 
requires some knowledge of the orbital structure of the 
disk, and is beyond the scope of this paper. Here the qual- 
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Figure 5. (from top-left to bottom-right): (1) true distribution (approximately that of a disk with Toomre parameter Q = 1.25) and 
mean restoration of f{r],h) out of 40 itterations for a SNR of 5 (2), 30 (3) and 100 (4). Abscissa is normalised specific energy 77 and 
ordinate is specific angular momentum h. Note that the isocontours are not sampled uniformly in order to display more acurately counter 
rotating stars. 


ity of the reconstruction is estimated while computing the 
mean distribution-weighted residual between the distribu¬ 
tion sought and the model recovered. It is defined by: 


error(f)=(|/ - /true|)i; 


/true,i I/i /true,i 




true,i 


(45) 


and measures the restored distribution error with respect 
to the true distribution /true(?7, h) averaged over the stars 
(i.e. weighted by the distribution /true), a set of simulations 
displaying this estimate for the quality of the reconstruction 
was carried while varying respectively the outer sampled 
edge of the disk, the signal to noise ratio, the sampling in the 
modeled distribution and the Q number of the underlying 
data set and is described bellow. 


4.-2.2 Validation: zero noise level inversion 

An inversion without any noise is first carried in order to 
assess the accuracy of our inversion routine. This turned 
out to be more difficult than doing the inversion with some 
knowledge of the noise level since in this instance there is 
no simple assessment of a good value for the Lagrange mul¬ 
tiplier /r. All the ill conditioning arises because of round off 
errors alone. The original distribution was eventually recov¬ 
ered in this manner with a mean distribution-weighted resid¬ 
ual, error(f), smaller than one part in ten to the four. From 
now on, the distribution f{r], h) derived from this noise-free 


inversion is taken in our simulations as the “true” underlying 
distribution. 


4-2.3 Choiee of the penalising funetion 

Let us now investigate the penalising functions correspond¬ 
ing to three methods of regularisation, namely MEM with 
uniform prior (as advocated by S^, MEM with smooth 
floating prior (given by Eqs. (pl[)-(p^)) and quadratic reg¬ 
ularisation (Eq. (p^)). The corresponding modelled and re¬ 
covered distribution functions are given in Figs. ^ and p[ 
From these hgures it appears clearly that MEM with uni¬ 
form prior is unsuitable in this context (this failure is ex¬ 
pected since here - in contrast to image reconstruction - no 
cutoff frequency forbits the roughness of the solution), while 
MEM with floating prior or quadratric penalizing function - 
which both enforces smoothness, provide similar results and 
yield a satisfactory level of regularisation. In particular, no 
qualitative difference occur owing to the penalising function 
alone, which is good indication that the inversion is carried 
adequately. Note that the apparent cusp at /i = 0 is well 
accounted for by the inversion. 

Regularisation by negentropy with a floating smooth 
prior was used in the following simulations. 
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Figure 6. true distribution and one sample for each SNR out of the 40 restorations carried out, displayed as in Fig. Abscissa is 
normalised specific energy rj and ordinate is specific angular momentum h. Comparison with Fig. ^ shows that the inversion is successful! 
both statistically and on a per sample basis. 


4-2.4 Efficiency: the influence of the noise level 


In the second part of the simulations, the performances of 
the proposed algorithm with respect to noise level is investi¬ 
gated. The noise is assumed to obey a Normal distribution 
with standard deviation given by: 




P. . 

^ 1,3 

SNR 


+ max(F) (Tbg . 


(46) 


In other words,the intrinsic data noise has a constant signal 
to noise ratio, SNR, and that the detector adds a uniform 
readout noise. Two sets of runs corresponding respectively 
to a constant signal to noise of SNR = 5, SNR = 30, and 
SNR — 100 are presented in Fig. ^ (mean recovered f), Fig. ^ 
(sample f) and Fig. ^ (standard deviation). In all cases, the 
readout noise level is Ubg = 10““^. The figures only display 
the inner part of the distribution while the simulation car¬ 
ries the inversion for h in the range [—2, 3[ and all possible 
energies. 

The main conclusion to be drawn from these figures is 
that the main features - both qualitatively and quantita¬ 
tively given the noise level - of the distribution are clearly 
recovered by this inversion procedure. Note that near the 
peak of the distribution at /i = 0, ?7 = 0, the recovered 
distribution is nonetheless slightly rounder than its original 
counterpart for the noisier {SNR = 5, panel 2) simulation. 
This is a residual bias of the re-parametrisation: the sought 
distribution is effectively undersampled in that region and 
the regularisation truncates the residual high frequency in 
the signal while incorrectly assuming that it corresponds to 
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noise. If the sampling had been tighter in that region, say 
using regular sampling in exp(—r;), the regularisation would 
not have truncated the restored distribution. Alternatively, 
in order to retain algebraic kernels, un-even logarithmic sam¬ 
pling in the spline basis is an option. 

This point illustrates the danger of non-parametric in¬ 
versions which clearly provide the best approach to model 
fitting but leave open some level of model-dependent tuning 
and consequently potential flaws when the wrong assump¬ 
tions are made on the nature of the sought solution for low 
signal to noise. For instance, the above described procedure 
would inherently ignore any central cusp in the disk if the 
sampling in parameter space is too sparse in that region, 
even if the signal to noise level is adequate to resolve the 
cusp. Since in practice a systematic oversampling is compu¬ 
tationally onerous given the dimensionality of the problem, 
special care should be taken while deciding what an ade¬ 
quate sampling and parametrisation involves. 

Finally, Fig. gives the evolution of the fit error signal 
to noise ratio for various Toomre parameters Q. This figure 
illustrates that the method is independent of the model disk, 
dynamically cold or hot. 


4-2.5 Efficiency: sampling in the model 

The best sampling of the phase space of / must be derived 
considering that two opposite criteria should be balanced: 
(i) using too few basis functions would bias the solution, 
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Figure 8. Minimum value of the normalised likelihood term 
X^/Aldata that can be reached as the number of basis functions 
varies and for different signal-to-noise ratios. In abscissa is given 
the number of samples along r] and h, which is the square root of 
the number of basis functions used. The number of data measure¬ 
ments were 50 X 50 and the maximum disk radius was Umax = 7. 
The curves are only here to clarify the figure: the simulation re¬ 
sults are plotted as symbols. 

(ii) using more basis functions consumes more CPU time. A 
simple and intuitive way to check that the sampling rate is 
sufficient is to insure that the minimum likelihood reached 
without regularisation is much smaller than the target like¬ 
lihood, i.e. lim^^o i(f) ^ Ng. Unregularised inversions of 
noisy data with an increasing number of basis functions and 
signal-to-noise ratios was therefore performed. In practice, 
since completely omitting the regularisation leads to a diffi¬ 
cult minimisation problem because of a large number of local 
minima, regularisation was instead relaxed by using a tar¬ 
get likelihood somewhat lower than the number of measure¬ 
ments {Ne ~ 0.1 X A^data). The results of these simulations 
are displayed in Fig. ^ It appears that ~ 150 x 150 basis 
functions are sufficient to avoid the sampling bias. In all the 
other simulations, 150 x 150 or 200 x 200 basis functions are 
used. 

4-2.6 Efficiency: truncation in the measurements 

The inversion algorithm presented here makes no assump¬ 
tion about completeness of the input data set. Therefore, the 
recovered solution / can in principle be used to predict miss¬ 
ing values in - in contrast to direct inversion methods 
which assume that is known everywhere. Real data will 
always be truncated at some maximum radius R < Rmax- 
There may also be missing measurements for instance due 
to dust clouds which hide some parts of the disk, or depar¬ 
ture from axial symmetry corresponding to spiral structure. 
In order to check how extrapolation proceeds, various trun¬ 
cated data were simulated sets and carried the inversion. 
Figure ^ shows the departure of the recovered distributions 
from the true one as a function of the outer radius Rmax up 
to which data is measured. This figure also shows that our 
inversion allows some extrapolation since for all signal-to- 
noise ratios considered, the error reaches it minimum value 
as soon as Rmax > 7 (i.e. 4 half mass radii. Re, to be com¬ 
pared to the true disk radius which was 10 in our simula- 


3456789 10 

maximum measured radius 

Figure 9. Evolution of the fit error as the data set is truncated in 
radius for different signal-to-noise ratios (in the simulations the 
disk outer edge was assumed to be 10; the curves are only here to 
guide the eye: the results of simulation are plotted as symbols). 



signal to noise ratio SNR 

Figure 10. Fit error versus SNR for various Toomre parameters 
Q. The fit error approximatively decreases as: error ~ 1.0 X 10“®-|- 
3.4 X 10“^/S?VR (solid line curve). 

tions). Note that interpolation is likely to be more reliable 
than extrapolation; our method should therefore be much 
less sensitive to data “holes”. 


5 DISCUSSION & CONCLUSION 

This paper presented a series of practical algorithms to ob¬ 
tain the distribution function / from the measured distri¬ 
butions F,f,{R,v^), compared those to existing algorithms 
and described in details the best suited to carry efficient in¬ 
version of such ill-conditioned problems. It was argued that 
non-parametric modelisation is best suited to describe the 
underlying distribution functions when no particular phys¬ 
ical model is to be favoured. For these inversions, regulari¬ 
sation is a crucial issue and its weight should be tuned “on 
the fly” according to the noise level. 

The mimimization algorithms brought forward in sec¬ 
tion ^ are fairly general and could clearly be implemented 
to minimization problems corresponding to other geometries 
such as that corresponding to the recovery of distributions 
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for spheroid or elliptical galaxies explored by other authors 
(e.g. Merritt & Tremblay (|^)). More generally it could be 
applied to any linear inversion problem where positivity is 
an issue; this includes image reconstruction, all Abel depro¬ 
jection arising in astronomy, etc. Applying this algorithm 
to simulated noisy data, it was found that the criteria of 
positivity and smoothness alone are sufficiently selective to 
regularize the inversion problem up to very low signal-to- 
noise ratios {SNR ~ 5) as soon as data is available up to 4 
Re. The inversion method described here is directly appli¬ 
cable to published measurements. 

Here the inversion assumed that the HI rotation curve 
gives access to an analytic (or spline) form for the potential. 
A more general procedure should provide a simultaneous re¬ 
covery of the potential, though such a routine would be very 
CPU intensive since chanching the potential requires to re¬ 
compute the matrix a. Nevertheless, it would be straight¬ 
forward to extend the scope of this method to configura¬ 
tions corresponding to an arbitrary slit angle as sketched in 
the Appendix or to data produced by integral field spec - 
troscopy [such as TIGRE or OASIS (Bacon et ah, 1995)] 
where the redundancy in azimuth would lead to higher sig¬ 
nal to noise ratios if the disk is still assumed to be flat and 
axi-symmetric. 

Once the distribution function has been characterised, 
it is possible to study quantitatively all departures from the 
flat axisymmetric stellar models. Indeed, axisymmetric dis¬ 
tribution functions are the building blocks of all sophisti¬ 
cated stability analyses, and a good phase space portrait of 
the unperturbed configuration is clearly needed in order to 
asses the stability of a given equilibrium state. Numerical 
N-body simulations require sets of initial conditions which 
should reflect the nature of the equilibrium. Linear stability 
analysis also rely on a detailed knowl edge of the underlying 
distribution( Pichon & Cannon, 1997). 
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APPENDIX A: BILINEAR INTERPOLATION 

In this non-parametric approach, the distribution /(tj, h) is 
described by its projection onto a basis of functions. If we 
choose a basis for which the two variables rj and h are sep¬ 
arable then Eq. (^ becomes: 

fiv, h) = EE fk,iUk{ri)vi{h) , (Al) 

k I 

where Uk{ri) and vi{h) are the new basis functions. This 
description of f{ri,h) yields: 

Fip{Ri, ^(^j) ~ ~ ^ ^ ^ ^ tli^j^k.l fk,l , (-'^^) 

k I 

where are coefficients which only depends on Ri, v^. 

and tjji {tf{R) in fact): 

ai,j,k,i = vi{RiV^.)D-2e,a,inij [ —dt;. (A3) 

Jvcij 
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Bilinear interpolation is implemented in the simulations de¬ 
scribed in section 4 to evaluate f{ri,h) everywhere. In this 
case, the weights fk,i are the values of the distribution at 
the sampling positions {{rjk, hi); k = 1,..., K;l — 1,..., L}: 

fk,i = f{Vk,hi) , 


1 - 

1 V-Vk 1 

1 ^ 1 

if rik-i <rj < rjk+i, 

0 

otherwise. 

1 - 

1 1 

1 1 

if hi-i < h< hi+i, 

0 


otherwise. 


and the basis functions are linear splines: 

Uk{v) = 
vi{h) = 

with r]k+n = rjk+n Arj and hi+n = hi+n/Sh. The bilinear in¬ 
terpolation is a particular case of the general non-parametric 
description. It yields a very sparse matrix a which can sig- 
nihcantly speed up matrix multiplications. The coefficients 
o-i,j,k,i can be computed analytically, though since the ba¬ 
sis functions are dehned piecewise, the integration much be 
performed piecewise: 


:d77= i cl+a'^ fork = 2 


Vv-v^ 


for k = 1, 
for fc = 2,. 
for k = K, 




with: 

= 


rmin{>7fc,l} 


Ukjri) 

max{77(,_i.77c} ^ 


dr], 


and 

// 

Q!fc — 


iv-3Vk-i + 2r]c) 

max{7)j,,7)c} ^ 

'0 


- r 7 =min{7)^,1} 

- '7=max{77fc_i,77c} 


if r]k<Vc, 
otherwise. 


dr?. 


^'^^ i^Vk+i-V-‘2Vc) 


if r^fc+i <7?c 
otherwise. 


7 ) = min{ 7 )s, 4.1,1} 

77 = max{7)fc,77c} 

Another useful feature of bilinear interpolation is that 


the positivity constraint is straightforward to implement: 


f{v,h) 


-E 

k,i 


fk,tUk{r])vi{h) > 0;W{T],h) ^ fk,i > 0;V(/c,/). 


There is no such simple relation for higher order splines. 


APPENDIX B: SPECIFIC MINIMISATION 
METHODS FOR MEM 


Several non-linear methods have been derived specihcally to 
seek the maximum entropy solution. Let us review those 
brie fly so as to compare them with our method (sec¬ 
tion 3.2.2). In MEM, assuming that (i) the prior p does not 
depend on the parameters and that (ii) the Hessian of the 
likelihood term can be neglected, the Hessian of Q is then 
purely diagonal: 


y^Qk,i — V\/Rk,i 


]rSk,i 

IT 


The direction of minimisation is therefore: 



Figure Cl. Angular notations 


SfMEMk = — — \7Qk ■ 


(Bl) 


Skilling & Bryan (1984) discussed further refin ements to 
speed up convergence. Cornwell & Evans ( 1984 ) approxi¬ 
mated the Hessian VVQ by neglecting all non-diagonal ele¬ 
ments: 


VVQfe,i ~ 5k,i-^ -I- 5k,i 


i,3,k 


fk 




Yax{Fi,j 


which yields: 
5fcEk = - 


fk ^Qk 


h + fk 


JVar(E,.,.) 


(B2) 


In fact, 5fcEk is equivalent to a steepest descen t step in the 
Levenberg-Marquart method (Press et ah, 1988) which is the 
method of predilection to fit a parametric non-linear model. 
Extending the Richardson-Lucy metho d to the maximum 
penalised likelihood regime, Lucy (199'1) suggests: 


5/Lucyj, — fk 


'VQk — 


El fi^Qi 
Eifi 


(B3) 


which is almost the same as in classical MEM but for the 
term Ef fi^Qi / Eg fi which accounts for the constraint 
that Ek fk should remain constant. Note that it is suffi¬ 
cient to replace VQ*, in Eq. @ by VQk - El El 

to apply a further constraint of normalisation. 

With all these non-linear methods, it may be advanta- 
geous to also seek the step size which minimises Q(f -I- A5f) 
ornwell & Evans, 1984; Lucy, 1994). 




APPENDIX C: GENERAL MODEL WITH 
ARBITRARY SLIT ORIENTATION 

Long slit spectroscopic observations of a galactic disk pro¬ 
vide the distribution: 

FER,v//)= J f{e,h)dv^, (Cl) 

where v// and v± are the star velocities (at intrinsic radius,r 
and projected radius R) along and perpendicular to the line 
of sight respectively, that are related to the radial and az¬ 
imuthal velocities: 

vr = ciV//+ C2V± , v,], = C3V/f + CiV± and R = cs r . 

Here the Ck depend on the angle a between the slit and the 
major axis of the disk as measured in the plane of the sky 
and on the inclination i of the disk axis with respect to the 
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line of sight (see Fig. Cl). The case where the slit is parallel 
to the major axis of the disk, i.e. a = 0, as been examined in 
the main text. When a ^ 0, the specific angular momentum 
h = rv^ can be used as variable of integration: 


FciR,v//) 


C5 

Rca 


f{£,h)dh. 


(C2) 


where the specihc energy and the integration bounds are: 





hn 


In practice, straightforward trigonometry yields 

Cl = sin(/3)/sin(i), C 2 = cos(/3), 

C 3 = cos(/I)/sin(i), C 4 = sin(/3), 

C5 = \/ cos2(/3) + sm^(j3) sin^(i) , 

where (3 is the angle of the slit as measured in the plane of 
the disk which obeys 


tan(/3) = tan(a)/sin(i) . 
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